Goals:
Overview of factors
Factors are “truly categorical” variables. They are vectors that:
We will use the gapminder dataset to explore working with factors in the following ways:
suppressPackageStartupMessages(library(tidyr))
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(knitr))
suppressPackageStartupMessages(library(forcats))
suppressPackageStartupMessages(library(plotly))
suppressPackageStartupMessages(library(gapminder))
str(gapminder)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1704 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num 779 821 853 836 740 ...
This confirms that we have two factors, country (142 levels), and continent (5 levels).
Suppose we wanted to remove observations in the continent of Oceania from the dataset.
To start, we should review the names of the level using the nlevels function:
levels(gapminder$continent)
## [1] "Africa" "Americas" "Asia" "Europe" "Oceania"
Next we filter the data. We can remove unused levels at the same time by piping the filter results into the droplevels() function:
gap_sans_OCE <- gapminder %>%
filter(continent != "Oceania") %>%
droplevels()
Now let’s confirm there are no longer any “Oceania” rows:
NROW(gap_sans_OCE %>%
filter(continent == "Oceania"))
## [1] 0
Looks good! We can also check how many rows were dropped:
# Subtract to find difference
NROW(gapminder) - NROW(gap_sans_OCE)
## [1] 24
We have now confirmed that all “Oceania” rows (n=24) have been removed from the original dataset. What about the levels?
levels(gap_sans_OCE$continent)
## [1] "Africa" "Americas" "Asia" "Europe"
Great! We have successfully removed the unused levels.
country or continentSuppose we wanted to plot mean life expectancy for each continent:
mean_pop <- gapminder %>%
group_by(continent) %>%
summarize(mean.pop = mean(pop))
knitr::kable(mean_pop)
| continent | mean.pop |
|---|---|
| Africa | 9916003 |
| Americas | 24504795 |
| Asia | 77038722 |
| Europe | 17169765 |
| Oceania | 8874672 |
Let’s plot the reordered data:
ggplot(mean_pop, aes(continent, mean.pop)) +
geom_point()
By default, the levels are ordered alphabetically, which does not help to highlight any patterns in the data.
We can change these to ascending order using the fct_reorder function:
mean_pop %>%
mutate(continent = fct_reorder(continent, mean.pop)) %>%
ggplot(aes(continent, mean.pop)) +
geom_point()
Let’s practice exporting and importing data. Say we wanted to look at life expectancies in American countries with a population over 5,000,000 for the year 2007:
America_pop_2007 <- gapminder %>%
filter(continent == "Americas", year == 2007, pop > 5000000) %>%
arrange(pop) # arrange in ascending order
Let’s view it as a table:
knitr::kable(America_pop_2007)
| country | continent | year | lifeExp | pop | gdpPercap |
|---|---|---|---|---|---|
| Nicaragua | Americas | 2007 | 72.899 | 5675356 | 2749.321 |
| Paraguay | Americas | 2007 | 71.752 | 6667147 | 4172.838 |
| El Salvador | Americas | 2007 | 71.878 | 6939688 | 5728.354 |
| Honduras | Americas | 2007 | 70.198 | 7483763 | 3548.331 |
| Haiti | Americas | 2007 | 60.916 | 8502814 | 1201.637 |
| Bolivia | Americas | 2007 | 65.554 | 9119152 | 3822.137 |
| Dominican Republic | Americas | 2007 | 72.235 | 9319622 | 6025.375 |
| Cuba | Americas | 2007 | 78.273 | 11416987 | 8948.103 |
| Guatemala | Americas | 2007 | 70.259 | 12572928 | 5186.050 |
| Ecuador | Americas | 2007 | 74.994 | 13755680 | 6873.262 |
| Chile | Americas | 2007 | 78.553 | 16284741 | 13171.639 |
| Venezuela | Americas | 2007 | 73.747 | 26084662 | 11415.806 |
| Peru | Americas | 2007 | 71.421 | 28674757 | 7408.906 |
| Canada | Americas | 2007 | 80.653 | 33390141 | 36319.235 |
| Argentina | Americas | 2007 | 75.320 | 40301927 | 12779.380 |
| Colombia | Americas | 2007 | 72.889 | 44227550 | 7006.580 |
| Mexico | Americas | 2007 | 76.195 | 108700891 | 11977.575 |
| Brazil | Americas | 2007 | 72.390 | 190010647 | 9065.801 |
| United States | Americas | 2007 | 78.242 | 301139947 | 42951.653 |
Let’s practice exporting to a comma separated values (.csv) file:
# set row.names to FALSE to avoide creation of extra ID column
write.csv(America_pop_2007, file = "America_pop_2007_5M.csv", row.names = FALSE)
We can see that the .csv was successfully exported to the project folder and the data are correct in Microsoft Excel.
Does its structure survive if we re-import it into R?
America_pop_2007_READ <- read.csv("America_pop_2007_5M.csv")
knitr::kable(America_pop_2007_READ)
| country | continent | year | lifeExp | pop | gdpPercap |
|---|---|---|---|---|---|
| Nicaragua | Americas | 2007 | 72.899 | 5675356 | 2749.321 |
| Paraguay | Americas | 2007 | 71.752 | 6667147 | 4172.838 |
| El Salvador | Americas | 2007 | 71.878 | 6939688 | 5728.354 |
| Honduras | Americas | 2007 | 70.198 | 7483763 | 3548.331 |
| Haiti | Americas | 2007 | 60.916 | 8502814 | 1201.637 |
| Bolivia | Americas | 2007 | 65.554 | 9119152 | 3822.137 |
| Dominican Republic | Americas | 2007 | 72.235 | 9319622 | 6025.375 |
| Cuba | Americas | 2007 | 78.273 | 11416987 | 8948.103 |
| Guatemala | Americas | 2007 | 70.259 | 12572928 | 5186.050 |
| Ecuador | Americas | 2007 | 74.994 | 13755680 | 6873.262 |
| Chile | Americas | 2007 | 78.553 | 16284741 | 13171.639 |
| Venezuela | Americas | 2007 | 73.747 | 26084662 | 11415.806 |
| Peru | Americas | 2007 | 71.421 | 28674757 | 7408.906 |
| Canada | Americas | 2007 | 80.653 | 33390141 | 36319.235 |
| Argentina | Americas | 2007 | 75.320 | 40301927 | 12779.380 |
| Colombia | Americas | 2007 | 72.889 | 44227550 | 7006.580 |
| Mexico | Americas | 2007 | 76.195 | 108700891 | 11977.575 |
| Brazil | Americas | 2007 | 72.390 | 190010647 | 9065.801 |
| United States | Americas | 2007 | 78.242 | 301139947 | 42951.653 |
| Great! The data are s | till arrange | d by in | trecing po | pulation, so | it surved the write-out/read-in process. |
Note: To write or read a .csv outside of the project folder, you need to enter in the full file path.
In this section I reproduce an old figure and give it some new life through the techniques I’ve learned in the last couple of weeks of lecture.
Below is the code and graph (“as is”) for a figure I was very proud of from Homework 02:
old_graph <- select(gapminder, gdpPercap, lifeExp, pop, year, continent) %>% #subsetting data
filter(year > 1957, continent=="Europe"|continent=="Africa") %>% #filter by criteria
ggplot(aes(lifeExp, gdpPercap))+ #piped to ggplot
geom_point(aes(color=continent, size=pop, alpha=0.1)) + # add aesthetics
xlab("Life Expectancy")+
ylab("GDP per Capita")
old_graph
new_graph <- select(gapminder, gdpPercap, lifeExp, pop, year, continent) %>%
filter(year > 1957, continent == "Europe"|continent == "Africa") %>% ggplot(aes(lifeExp, gdpPercap)) +
geom_point(aes(color = continent, size = pop, alpha = 0.1)) +
labs(title = "Life expectancy for each continent",
x = "Life Expectancy",
y = "GDP per Capita") +
theme_bw()+
theme(axis.text.x = element_text(vjust=0.6, size=10),
axis.text = element_text(size = 10)) +
scale_colour_brewer(palette = "Set1") + # use brewer scheme
geom_smooth(method="lm") #add a linear regression line
new_graph
new_graph_plotly <- new_graph
ggplotly(new_graph_plotly)
One of the first changes I made was to remove the unnecessary comments, as these did not provide extra information and make the code difficult to read, violating the recommendations by Hadley Wickham in the tidyverse style guide. Also in line with the style guide, I added spaces before and after each of the “=” and “+” in order to improve legibility.
I also added a linear regression line, which is particularly nice to include on the plotly plot, as it allows you to identify the predicted values from the model interactively. I could see this–along with the fact that plotly provide access to the specific attribute information for each point–being extremely useful for exploring and presenting data in the future.
With ggsave() it’s easy to export a figure to as a file on your local computer:
ggsave("new_graph.png", new_graph)
## Saving 7 x 5 in image
We also may want to adjust the dimensions and resolution:
ggsave("new_graph_5x4.png", width = 5, height = 4, dpi = 150)
We can also save to vector format with .pdf
ggsave("new_graph.pdf", new_graph)
## Saving 7 x 5 in image
The ggsave( ) function will, by default, save the most recent plot in R. To save a different image, we have to make this explicit in the ggsave code. For example, if we wanted to save the old graph (see above), we would make the following changes to the code:
ggsave("old_graph.png", plot = old_graph)
## Saving 7 x 5 in image